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Abstract 

We address the rating-inference problem, 
wherein rather than simply decide whether 
a review is "thumbs up" or "thumbs 
down", as in previous sentiment analy- 
sis work, one must determine an author's 
evaluation with respect to a multi-point 
scale (e.g., one to five "stars"). This task 
represents an interesting twist on stan- 
dard multi-class text categorization be- 
cause there are several different degrees 
of similarity between class labels; for ex- 
ample, "three stars" is intuitively closer to 
"four stars" than to "one star". 

We first evaluate human performance at 
the task. Then, we apply a meta- 
algorithm, based on a metric labeling for- 
mulation of the problem, that alters a 
given n-wy classifier's output in an ex- 
plicit attempt to ensure that similar items 
receive similar labels. We show that 
the meta-algorithm can provide signifi- 
cant improvements over both multi-class 
and regression versions of SVMs when we 
employ a novel similarity measure appro- 
priate to the problem. 

Publication info: Proceedings of the 
ACL, 2005. 

1 Introduction 

There has recently been a dramatic surge of interest 
in sentiment analysis, as more and more people 



become aware of the scientific challenges posed 
and the scope of new applications enabled by the 
processing of subjective language. (The papers 
collected by Qu, Shanahan, and Wiebe (120041) 
form a representative sample of research in the 
area.) Most prior work on the specific problem of 
categorizing expressly opinionated text has focused 
on the binary distinction of positive vs. negative 
( |Turney, 2002[ |Pang, Lee, and Va ithyana tnah, 2002[ 
|Dave, Lawrence, and Pennock, 2003| 
Yu and Hatzivassiloglou, 2003 ). But it is often 



helpful to have more information than this binary 
distinction provides, especially if one is ranking 
items by recommendation or comparing several 
reviewers' opinions: example applications in- 
clude collaborative filtering and deciding which 
conference submissions to accept. 

Therefore, in this paper we consider generaliz- 
ing to finer-grained scales: rather than just deter- 
mine whether a review is "thumbs up" or not, we 
attempt to infer the author's implied numerical rat- 
ing, such as "three stars" or "four stars". Note 
that this differs from identifying opinion strength 
(Wilson, Wieb e, and Hwa, 2004| ): rants and raves 
have the same strength but represent opposite eval- 
uations, and referee forms often allow one to indi- 
cate that one is very confident (high strength) that 
a conference submission is mediocre (middling rat- 
ing). Also, our task differs from ranking not only 
because one can be given a single item to classify 
(as opposed to a set of items to be ordered relative to 
one another), but because there are settings in which 
classification is harder than ranking, and vice versa. 

One can apply standard n-ary classifiers or regres- 



sion to this rating-inference problem; independent 
work by Kop pel and Schler (2005| > considers such 
methods. But an alternative approach that explic- 
itly incorporates information about item similarities 
together with label similarity information (for in- 
stance, "one star" is closer to "two stars" than to 
"four stars") is to think of the task as one of met- 
ric labeling ( Kleinb erg and Tardos, 2002| i, where la- 
bel relations are encoded via a distance metric. 
This observation yields a meta-algorithm, applicable 
to both semi-supervised (via graph-theoretic tech- 
niques) and supervised settings, that alters a given 
n-ary classifier's output so that similar items tend to 
be assigned similar labels. 

In what follows, we first demonstrate that hu- 
mans can discern relatively small differences in (hid- 
den) evaluation scores, indicating that rating infer- 
ence is indeed a meaningful task. We then present 
three types of algorithms — one-vs-all, regression, 
and metric labeling — that can be distinguished by 
how explicitly they attempt to leverage similarity 
between items and between labels. Next, we con- 
sider what item similarity measure to apply, propos- 
ing one based on the positive-sentence percentage. 
Incorporating this new measure within the metric- 
labeling framework is shown to often provide sig- 
nificant improvements over the other algorithms. 

We hope that some of the insights derived here 
might apply to other scales for text classifcation that 
have been considered, such as clause-level opin- 
ion strength ( |Wilson, Wiebe, and Hwa, 2004| ); af- 
fect types like disgust (Sub asic and Huettner, 2001] 
|Liu, L ieberma nTand Selker, 2003 1 ); reading level 
( Collins-Th ompson and Callan, 2004| ); and urgency 
or criticality ( |Horvitz, Jacobs, and Hovel, 1999| ). 
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2 Problem validation and formulation 

We first ran a small pilot study on human subjects 
in order to establish a rough idea of what a reason- 
able classification granularity is: if even people can- 
not accurately infer labels with respect to a five-star 
scheme with half stars, say, then we cannot expect a 
learning algorithm to do so. Indeed, some potential 
obstacles to accurate rating inference include lack 
of calibration (e.g., what an understated author in- 
tends as high praise may seem lukewarm), author 
inconsistency at assigning fine-grained ratings, and 



Table 1: Human accuracy at determining relative 
positivity. Rating differences are given in "notches". 
Parentheses enclose the number of pairs attempted. 

ratings not entirely supported by the text 1 . 

For data, we first collected Internet movie reviews 
in English from four authors, removing explicit rat- 
ing indicators from each document's text automati- 
cally. Now, while the obvious experiment would be 
to ask subjects to guess the rating that a review rep- 
resents, doing so would force us to specify a fixed 
rating-scale granularity in advance. Instead, we ex- 
amined people's ability to discern relative differ- 
ences, because by varying the rating differences rep- 
resented by the test instances, we can evaluate mul- 
tiple granularities in a single experiment. Specifi- 
cally, at intervals over a number of weeks, we au- 
thors (a non-native and a native speaker of English) 
examined pairs of reviews, attemping to determine 
whether the first review in each pair was (1) more 
positive than, (2) less positive than, or (3) as posi- 
tive as the second. The texts in any particular review 
pair were taken from the same author to factor out 
the effects of cross-author divergence. 

As Table Q shows, both subjects performed per- 
fectly when the rating separation was at least 3 
"notches" in the original scale (we define a notch 
as a half star in a four- or five-star scheme and 10 
points in a 100-point scheme). Interestingly, al- 
though human performance drops as rating differ- 
ence decreases, even at a one-notch separation, both 
subjects handily outperformed the random-choice 
baseline of 33%. However, there was large variation 
in accuracy between subjects. 2 



'For example, the critic Dennis Schwartz writes that "some- 
times the review itself [indicates] the letter grade should have 
been higher or lower, as the review might fail to take into con- 
sideration my overall impression of the film — which I hope to 
capture in the grade" (http://www.sover.net/~ozus/cinema.htm). 

2 One contributing factor may be that the subjects viewed 
disjoint document sets, since we wanted to maximize experi- 
mental coverage of the types of document pairs within each dif- 
ference class. We thus cannot report inter-annotator agreement, 



Because of this variation, we defined two dif- 
ferent classification regimes. From the evidence 
above, a three-class task (categories 0, 1, and 2 — 
essentially "negative", "middling", and "positive", 
respectively) seems like one that most people would 
do quite well at (but we should not assume 100% 
human accuracy: according to our one-notch results, 
people may misclassify borderline cases like 2.5 
stars). Our study also suggests that people could 
do at least fairly well at distinguishing full stars 
in a zero- to four-star scheme. However, when we 
began to construct five-category datasets for each of 
our four authors (see below), we found that in each 
case, either the most negative or the most positive 
class (but not both) contained only about 5% of the 
documents. To make the classes more balanced, we 
folded these minority classes into the adjacent class, 
thus arriving at a four-class problem (categories 
0-3, increasing in positivity). Note that the four- 
class problem seems to offer more possibilities for 
leveraging class relationship information than the 
three-class setting, since it involves more class pairs. 
Also, even the two-category version of the rating- 
inference problem for movie reviews has proven 
quite challenging for many automated classification 
techniques ( Pang, Lee, and Va ithyanat han, 2002| 



Turney, 2002). 

We applied the above two labeling schemes to 
a scale dataset 3 containing four corpora of movie 
reviews. All reviews were automatically pre- 
processed to remove both explicit rating indicators 
and objective sentences; the motivation for the latter 
step is that it has previously aided positive vs. neg- 
ative classification (Pang and Lee, 2004). All of the 
1770, 902, 1307, or 1027 documents in a given cor- 
pus were written by the same author. This decision 
facilitates interpretation of the results, since it fac- 
tors out the effects of different choices of methods 
for calibrating authors' scales. 4 We point out that 



but since our goal is to recover a reviewer's "true" recommen- 
dation, reader-author agreement is more relevant. 

While another factor might be degree of English fluency, in 
an informal experiment (six subjects viewing the same three 
pairs), native English speakers made the only two errors. 

3 Available at http://www.cs.cornell.edu/People/pabo/movie- 
review-data as scale dataset vl.O. 

4 From the Rotten Tomatoes website's FAQ: "star systems 
are not consistent between critics. For critics like Roger Ebert 
and James Berardinelli, 2.5 stars or lower out of 4 stars is al- 
ways negative. For other critics, 2.5 stars can either be positive 



it is possible to gather author-specific information 
in some practical applications: for instance, systems 
that use selected authors (e.g., the Rotten Tomatoes 
movie-review website — where, we note, not all 
authors provide explicit ratings) could require that 
someone submit rating-labeled samples of newly- 
admitted authors' work. Moreover, our results at 
least partially generalize to mixed-author situations 
(see Section l5~2l . 

3 Algorithms 

Recall that the problem we are considering is multi- 
category classification in which the labels can be 
naturally mapped to a metric space (e.g., points on a 
line); for simplicity, we assume the distance metric 
d{£,£') = \£ — l'\ throughout. In this section, we 
present three approaches to this problem in order of 
increasingly explicit use of pairwise similarity infor- 
mation between items and between labels. In order 
to make comparisons between these methods mean- 
ingful, we base all three of them on Support Vec- 
tor Machines (SVMs) as implemented in Joachims' 
(119991 SVM^ M package. 

3.1 One-vs-all 

The standard SVM formulation applies only 
to binary classification. One-vs-all (OVA) 
(Rif kin and Klautau, 20041 is a common exten- 
sion to the n-ary case. Training consists of building, 
for each label I, an SVM binary classifier distin- 
guishing label £ from "not-£". We consider the final 
output to be a label preference function 7r ova (x,£), 
defined as the signed distance of (test) item x to the 
£ side of the £ vs. not-£ decision plane. 

Clearly, OVA makes no explicit use of pairwise 
label or item relationships. However, it can perform 
well if each class exhibits sufficiently distinct lan- 
guage; see Section|4]for more discussion. 

3.2 Regression 

Alternatively, we can take a regression perspec- 
tive by assuming that the labels come from a dis- 
cretization of a continuous function g mapping 



or negative. Even though Eric Lurio uses a 5 star system, his 
grading is very relaxed. So, 2 stars can be positive." Thus, 
calibration may sometimes require strong familiarity with the 
authors involved, as anyone who has ever needed to reconcile 
conflicting referee reports probably knows. 



from the feature space to a metric space. 5 If 
we choose g from a family of sufficiently "grad- 
ual" functions, then similar items necessarily re- 
ceive similar labels. In particular, we consider lin- 
ear, e-insensitive SVM regression ( Vapnik, 1995} 
Smola and Scholkopf, 1998); the idea is to find the 
hyperplane that best fits the training data, but where 
training points whose labels are within distance e of 
the hyperplane incur no loss. Then, for (test) in- 
stance x, the label preference function 7r ieg (x,£) is 
the negative of the distance between i and the value 
predicted for x by the fitted hyperplane function. 

IWilson, Wiebe, and Hwa (2004H used SVM re- 
gression to classify clause-level strength of opinion, 
reporting that it provided lower accuracy than other 
methods. However, independently of our work, 



Koppel and Schler (2005 ) found that applying lin- 
ear regression to classify documents (in a different 
corpus than ours) with respect to a three-point rat- 
ing scale provided greater accuracy than OVA SVMs 
and other algorithms. 

3.3 Metric labeling 

Regression implicitly encodes the "similar items, 
similar labels" heuristic, in that one can restrict 
consideration to "gradual" functions. But we can 
also think of our task as a metric labeling prob- 



lem ( Kleinberg and Tardos, 2002 1, a special case 
of the maximum a posteriori estimation problem 
for Markov random fields, to explicitly encode our 
desideratum. Suppose we have an initial label pref- 
erence function tt(x,£), perhaps computed via one 
of the two methods described above. Also, let d 
be a distance metric on labels, and let nrik(x) de- 
note the k nearest neighbors of item x according 
to some item-similarity function sim. Then, it is 
quite natural to pose our problem as finding a map- 
ping of instances x to labels £ x (respecting the orig- 
inal labels of the training instances) that minimizes 
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zetest 



-w(x,£ x ) + a ^2 f(d(£ x Jy))sim(x,y) 

y£nn k {x) 

where / is monotonically increasing (we chose 
f(d) = d unless otherwise specified ) and a is a 
trade-off and/or scaling parameter. (The inner sum- 
mation is familiar from work in locally-weighted 



learning 6 ( |Atkeson, Moore, and Schaal, 1997| >.) In 
a sense, we are using explicit item and label simi- 
larity information to increasingly penalize the initial 
classifier as it assigns more divergent labels to simi- 
lar items. 

In this paper, we only report supervised-learning 
experiments in which the nearest neighbors for any 
given test item were drawn from the training set 
alone. In such a setting, the labeling decisions for 
different test items are independent, so that solving 
the requisite optimization problem is simple. 

Aside: transduction The above formulation also 
allows for transductive semi-supervised learning as 
well, in that we could allow nearest neighbors to 
come from both the training and test sets. We 
intend to address this case in future work, since 
there are important settings in which one has a 
small number of labeled reviews and a large num- 
ber of unlabeled reviews, in which case consider- 
ing similarities between unlabeled texts could prove 
quite helpful. In full generality, the correspond- 
ing multi-label optimization problem is intractable, 
but for many families of / functions (e.g., con- 
vex) there exist practical exact or approximation 
algorithms based on techniques for finding mini- 
mum s-t cuts in graphs ( Ishikawa and Geiger, 1998} 
IBoykov, Veksler, andZabih, 1999}|lsh~ikawa, 2003| i. 
Interestingly, previous sentiment analysis research 
found that a minimum-cut formulation for the binary 
subjective/objective distinction yielded good results 
(Pang and Lee, 200"3} l. Of course, there are many 
other related semi-supervised learning algorithms 
that we would like to try as well; see |Zhu ( 2005 ) 
for a survey. 

4 Class struggle: finding a label-correlated 
item-similarity function 

We need to specify an item similarity function sim 
to use the metric-labeling formulation described in 
Section 1331 We could, as is commonly done, em- 
ploy a term-overlap-based measure such as the co- 
sine between term-frequency-based document vec- 
tors (henceforth "TO(cos)"). However, Table |2] 



5 We discuss the ordinal regression variant in Section|6| 



6 If we ignore the ir(x,£) term, different choices of / cor- 
respond to different versions of nearest-neighbor learning, e.g., 
majority- vote, weighted average of labels, or weighted median 
of labels. 
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Table 2: Average over authors and class pairs of 
between-class vocabulary overlap as the class labels 
of the pair grow farther apart. 

shows that in aggregate, the vocabularies of distant 
classes overlap to a degree surprisingly similar to 
that of the vocabularies of nearby classes. Thus, 
item similarity as measured by TO(cos) may not cor- 
relate well with similarity of the item's true labels. 

We can potentially develop a more useful simi- 
larity metric by asking ourselves what, intuitively, 
accounts for the label relationships that we seek 
to exploit. A simple hypothesis is that ratings 
can be determined by the positive-sentence per- 
centage (PSP) of a text, i.e., the number of posi- 
tive sentences divided by the number of subjective 
sentences. (Term-based versions of this premise 
have motivated much sentiment-analysis work for 
over a decade ( |Das and Chen, 200T] Tong, 200T] 
Turney, 2002| l.) But counterexamples are easy to 
construct: reviews can contain off-topic opinions, 
or recount many positive aspects before describing 
a fatal flaw. 

We therefore tested the hypothesis as follows. To 
avoid the need to hand-label sentences as positive or 
negative, we first created a sentence polarity dataset 
7 consisting of 10,662 movie-review "snippets" (a 
striking extract usually one sentence long) down- 
loaded from www.rottentomatoes.com; each snippet 
was labeled with its source review's label (positive 
or negative) as provided by Rotten Tomatoes. Then, 
we trained a Naive Bayes classifier on this data set 
and applied it to our scale dataset to identify the pos- 
itive sentences (recall that objective sentences were 
already removed). 

Figure [2 shows that all four authors tend to ex- 
hibit a higher PSP when they write a more pos- 
itive review, and we expect that most typical re- 
viewers would follow suit. Hence, PSP appears to 
be a promising basis for computing document sim- 



ilarity for our rating-inference task. In particular, 
we defined PSP(x) to be the two-dimensional vec- 
tor (PSP (a), 1 - PSP(x)), and then set the item- 
similarity function required by the metric-labeling 
optimization function (Section 1331 to sim(x,y) = 

cos fpSP(xj,PSP(y) 



Positive-sentence percentage (PSP) statistics 




rating (in notches) 



Figure 1: Average and standard deviation of PSP 
for reviews expressing different ratings. 

But before proceeding, we note that it is possi- 
ble that similarity information might yield no extra 
benefit at all. For instance, we don't need it if we 
can reliably identify each class just from some set 
of distinguishing terms. If we define such terms 
as frequent ones (n > 20) that appear in a sin- 
gle class 50% or more of the time, then we do find 
many instances; some examples for one author are: 
"meaningless", "disgusting" (class 0); "pleasant", 
"uneven" (class 1); and "oscar", "gem" (class 2) 
for the three-class case, and, in the four-class case, 
"flat", "tedious" (class 1) versus "straightforward", 
"likeable" (class 2). Some unexpected distinguish- 
ing terms for this author are "lion" for class 2 (three- 
class case), and for class 2 in the four-class case, 
"jennifer", for a wide variety of Jennifers. 

5 Evaluation 

This section compares the accuracies of the ap- 
proaches outlined in Section |3] on the four corpora 



7 Available at http://www.cs.cornell.edu/People/pabo/movie- 
review-data as sentence polarity dataset vl.O. 



While admittedly we initially chose this function because 
it was convenient to work with cosines, post hoc analysis re- 
vealed that the corresponding metric space "stretched" certain 
distances in a useful way. 



comprising our scale dataset. (Results using L\ er- 
ror were qualitatively similar.) Throughout, when 
we refer to something as "significant", we mean sta- 
tistically so with respect to the paired i-test, p < .05. 

The results that follow are based on SVM'^'s 
default parameter settings for SVM regression and 
OVA. Preliminary analysis of the effect of varying 
the regression parameter e in the four-class case re- 
vealed that the default value was often optimal. 

The notation "A+B" denotes metric labeling 
where method A provides the initial label preference 
function ir and B serves as similarity measure. To 
train, we first select the meta-parameters k and a 
by running 9-fold cross-validation within the train- 
ing set. Fixing k and a to those values yielding the 
best performance, we then re-train A (but with SVM 
parameters fixed, as described above) on the whole 
training set. At test time, the nearest neighbors of 
each item are also taken from the full training set. 

5.1 Main comparison 

Figure |2] summarizes our average 10-fold cross- 
validation accuracy results. We first observe from 
the plots that all the algorithms described in Section 
|3]always definitively outperform the simple baseline 
of predicting the majority class, although the im- 
provements are smaller in the four-class case. In- 
cidentally, the data was distributed in such a way 
that the absolute performance of the baseline it- 
self does not change much between the three- and 
four-class case (which implies that the three-class 
datasets were relatively more balanced); and Author 
c's datasets seem noticeably easier than the others. 

We now examine the effect of implicitly using la- 
bel and item similarity. In the four-class case, re- 
gression performed better than OVA (significantly 
so for two authors, as shown in the righthand ta- 
ble); but for the three-category task, OVA signifi- 
cantly outperforms regression for all four authors. 
One might initially interprete this "flip" as showing 
that in the four-class scenario, item and label simi- 
larities provide a richer source of information rela- 
tive to class-specific characteristics, especially since 
for the non-majority classes there is less data avail- 
able; whereas in the three-class setting the categories 
are better modeled as quite distinct entities. 

However, the three-class results for metric label- 
ing on top of OVA and regression (shown in Figure|2] 



by black versions of the corresponding icons) show 
that employing explicit similarities always improves 
results, often to a significant degree, and yields the 
best overall accuracies. Thus, we can in fact effec- 
tively exploit similarities in the three-class case. Ad- 
ditionally, in both the three- and four- class scenar- 
ios, metric labeling often brings the performance of 
the weaker base method up to that of the stronger 
one (as indicated by the "disappearance" of upward 
triangles in corresponding table rows), and never 
hurts performance significantly. 

In the four-class case, metric labeling and regres- 
sion seem roughly equivalent. One possible inter- 
pretation is that the relevant structure of the problem 
is already captured by linear regression (and per- 
haps a different kernel for regression would have 
improved its three-class performance). However, 
according to additional experiments we ran in the 
four-class situation, the test-set-optimal parameter 
settings for metric labeling would have produced 
significant improvements, indicating there may be 
greater potential for our framework. At any rate, we 
view the fact that metric labeling performed quite 
well for both rating scales as a definitely positive re- 
sult. 

5.2 Further discussion 

Q: Metric labeling looks like it's just combining 
SVMs with nearest neighbors, and classifier combi- 
nation often improves performance. Couldn't we get 
the same kind of results by combining SVMs with 
any other reasonable method? 
A: No. For example, if we take the strongest 
base SVM method for initial label preferences, but 
replace PSP with the term-overlap-based cosine 
(TO(cos)), performance often drops significantly. 
This result, which is in accordance with Section 
HJs data, suggests that choosing an item similarity 
function that correlates well with label similarity 
is important. (ova+PSP «« ova+TO(cos) [3c] ; 
reg+PSP < reg+TO(cos) [4c]) 
Q: Could you explain that notation, please? 
A: Triangles point toward the significantly bet- 
ter algorithm for some dataset. For instance, 
"M «> N [3c]" means, "In the 3-class task, method 
M is significantly better than N for two author 
datasets and significantly worse for one dataset (so 
the algorithms were statistically indistinguishable on 
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information via metric labeling. 



Figure 2: Results for main experimental comparisons. 



the remaining dataset)". When the algorithms be- 
ing compared are statistically indistinguishable on 
all four datasets (the "no triangles" case), we indi- 
cate this with an equals sign ("="). 

Q: Thanks. Doesn't Figure ffl show that the 
positive-sentence percentage would be a good 
classifier even in isolation, so metric labeling isn't 
necessary? 

A: No. Predicting class labels directly from 
the PSP value via trained thresholds isn't as 
effective (ova+PSP «« threshold PSP [3c] ; 

reg+PSP « threshold PSP [4c]). 



Alternatively, we could use only the PSP com- 
ponent of metric labeling by setting the la- 
bel preference function to the constant function 
0, but even with test-set-optimal parameter set- 
tings, doing so underperforms the trained met- 
ric labeling algorithm with access to an ini- 
tial SVM classifier (ova+PSP «« 0+PSP* [3c]; 
reg+PSP « 0+PSP* [4c]). 

Q: What about using PSP as one of the features for 
input to a standard classifier? 
A: Our focus is on investigating the utility of simi- 
larity information. In our particular rating-inference 



setting, it so happens that the basis for our pair- 
wise similarity measure can be incorporated as an 
item-specific feature, but we view this as a tan- 
gential issue. That being said, preliminary experi- 
ments show that metric labeling can be better, barely 
(for test-set-optimal parameter settings for both al- 
gorithms: significantly better results for one author, 
four-class case; statistically indistinguishable other- 
wise), although one needs to determine an appropri- 
ate weight for the PSP feature to get good perfor- 
mance. 

Q: You defined the "metric transformation" func- 
tion / as the identity function f(d) = d, imposing 
greater loss as the distance between labels assigned 
to two similar items increases. Can you do just as 
well if you penalize all non-equal label assignments 
by the same amount, or does the distance between 
labels really matter? 

A: You're asking for a comparison to the Potts 
model, which sets / to the function f(d) = 
1 if d > 0,0 otherwise. In the one set- 
ting in which there is a significant difference 
between the two, the Potts model does worse 
(ova+PSP < ova+PSP [3c]). Also, employing the 
Potts model generally leads to fewer significant 
improvements over a chosen base method (com- 
pare Figure 13s tables with: reg+PSP < reg [3c]; 
ova+PSP « ova [3c] ; ova+PSP = ova [4c] ; but 
note that reg+PSP < reg [4c]). We note that opti- 
mizing the Potts model in the multi-label case is NP- 
hard, whereas the optimal metric labeling with the 
identity metric-transformation function can be effi- 
ciently obtained (see Section l3~3b . 

Q: Your datasets had many labeled reviews and only 
one author each. Is your work relevant to settings 
with many authors but very little data for each? 
A: As discussed in Section |2j it can be quite dif- 
ficult to properly calibrate different authors' scales, 
since the same number of "stars" even within what 
is ostensibly the same rating system can mean differ- 
ent things for different authors. But since you ask: 
we temporarily turned a blind eye to this serious is- 
sue, creating a collection of 5394 reviews by 496 au- 
thors with at most 80 reviews per author, where we 
pretended that our rating conversions mapped cor- 
rectly into a universal rating scheme. Preliminary 
results on this dataset were actually comparable to 



the results reported above, although since we are 
not confident in the class labels themselves, more 
work is needed to derive a clear analysis of this set- 
ting. (Abusing notation, since we're already play- 
ing fast and loose: [3c]: baseline 52.4%, reg 61.4%, 
reg+PSP 61.5%, ova (65.4%) > ova+PSP (66.3%); 
[4c]: baseline 38.8%, reg (51.9%) > reg+PSP 
(52.7%), ova (53.8%) > ova+PSP (54.6%)) 

In future work, it would be interesting to deter- 
mine author-independent characteristics that can be 
used on (or suitably adapted to) data for specific au- 
thors. 

Q: How about trying — 

A: — Yes, there are many alternatives. A few 
that we tested are described in the Appendix, and 
we propose some others in the next section. We 
should mention that we have not yet experimented 
with all-vs.-all (AVA), another standard binary-to- 
multi-category classifier conversion method, be- 
cause we wished to focus on the effect of omit- 
ting pairwise information. In independent work on 
3-category rating inference for a different corpus, 



Koppel and Schler (2005 ) found that regression out- 
performed AVA, and jRifkin and Klautau (2004 ) ar- 
gue that in principle OVA should do just as well as 
AVA. But we plan to try it out. 

6 Related work and future directions 

In this paper, we addressed the rating-inference 
problem, showing the utility of employing label sim- 
ilarity and (appropriate choice of) item similarity 
— either implicitly, through regression, or explicitly 
and often more effectively, through metric labeling. 

In the future, we would like to apply our 
methods to other scale-based classification 
problems, and explore alternative methods. 
Clearly, varying the kernel in SVM regres- 
sion might yield better results. Another 
choice is ordinal regression (McCullagh , 1980| 
Herbrich, Grae pel, and Obermayer, 2000| ), which 
only considers the ordering on labels, rather 
than any explicit distances between them; this 
approach could work well if a good metric on 
labels is lacking. Also, one could use mixture 
models (e.g., combine "positive" and "negative" 
language models) to capture class relationships 
flMcCallum, XW5[ |Schapire and Singer, 2000] 



ITakamura, Matsumoto, and Yamada, 20 04 ). 

We are also interested in framing multi-class but 
TioTi-scale-based categorization problems as metric 
labeling tasks. For example, positive vs. nega- 
tive vs. neutral sentiment distinctions are sometimes 
considered in which neutral means either objective 
(Engs trcjiri, 2004| l or a conflation of objective with 
a rating of mediocre ( |Das and Chen, 200T| >. (Kop- 
pel and Schler ( 2005 ) in independent work also dis- 
cuss various types of neutrality.) In either case, we 
could apply a metric in which positive and negative 
are closer to objective (or objective+mediocre) than 
to each other. As another example, hierarchical label 
relationships can be easily encoded in a label metric. 

Finally, as mentioned in Section 13.31 we would 
like to address the transductive setting, in which one 
has a small amount of labeled data and uses rela- 
tionships between unlabeled items, since it is par- 
ticularly well-suited to the metric-labeling approach 
and may be quite important in practice. 
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A Appendix: other variations attempted 

A.l Discretizing binary classification 

In our setting, we can also incorporate class relations 
by directly altering the output of a binary classifier, 
as follows. We first train a standard SVM, treating 
ratings greater than 0.5 as positive labels and others 
as negative labels. If we then consider the resulting 
classifier to output a positivity-preference function 
7r + (x), we can then learn a series of thresholds to 
convert this value into the desired label set, under 
the assumption that the bigger vr + (x) is, the more 
positive the review. 9 This algorithm always outper- 
forms the majority-class baseline, but not to the de- 
gree that the best of SVM OVA and SVM regres- 
sion does. Koppel and Schler (2005 ) independently 
found in a three-class study that thresholding a pos- 
itive/negative classifier trained only on clearly posi- 
tive or clearly negative examples did not yield large 
improvements. 

A.2 Discretizing regression 

In our experiments with SVM regression, we dis- 
cretized regression output via a set of fixed decision 
thresholds {0.5, 1.5, 2.5, ...} to map it into our set of 
class labels. Alternatively, we can learn the thresh- 
olds instead. Neither option clearly outperforms the 

'This is not necessarily true: if the classifier's goal is to opti- 
mize binary classification error, its major concern is to increase 
confidence in the positive/negative distinction, which may not 
correspond to higher confidence in separating "five stars" from 
"four stars". 



other in the four-class case. In the three-class set- 
ting, the learned version provides noticeably better 
performance in two of the four datasets. But these 
results taken together still mean that in many cases, 
the difference is negligible, and if we had started 
down this path, we would have needed to consider 
similar tweaks for one-vs-all SVM as well. We 
therefore stuck with the simpler version in order to 
maintain focus on the central issues at hand. 



